🏠 Vault / School / H2 Math / MIT_Project / README.md

MIT Project: From Lines to Networks

<#>MIT Project: From Lines to Networks <##>Predicting HDB Resale Prices with Linear Regression and MLP

Mathematical Investigative Task (MIT) 2026 Hwa Chong Institution | H2 Mathematics

<##>Project Overview

This project demonstrates the mathematical foundations of machine learning by implementing Linear Regression and a Multi-Layer Perceptron (MLP) from scratch using only NumPy. We apply these models to predict HDB resale prices in Singapore, showcasing real-world applications of H2 Mathematics concepts.

<###>Theme SDG 11: Sustainable Cities — Understanding housing affordability through data-driven analysis

<##>Mathematical Concepts Demonstrated

<###>Linear Regression

Statistics & Regression: Correlation, R² score, mean squared error
Calculus: Gradient descent optimization, partial derivatives
Linear Algebra: Matrix operations, normal equation

<###>Multi-Layer Perceptron (MLP)

Calculus: Chain rule for backpropagation
Functions & Graphs: Activation functions (ReLU)
Linear Algebra: Matrix-vector multiplication
Optimization: Gradient descent with weight updates

<##>Dataset

Source: Singapore Government Data (data.gov.sg) File: Resale flat prices based on registration date from Jan-2017 onwards Size: 229,273 transactions

<###>Features Engineered | Feature | Description | Mathematical Role | |---------|-------------|-------------------| | floor_area_sqm | Flat size in square meters | Input variable | | remaining_lease_years | Years left on lease | Input variable | | storey_mid | Middle of storey range | Input variable | | town_* | One-hot encoded towns | Categorical features | | flat_* | One-hot encoded flat types | Categorical features |

<##>Results

<###>Linear Regression on Real HDB Data

R² (Test): 0.52
Interpretation: 52% of price variation explained
Key Findings:

- Floor area: +$2,000 per sqm - Remaining lease: +$37,000 per year - Higher floors: +$55,000 premium

<###>Why Linear Regression Works Well HDB pricing has mostly linear relationships:

Larger flats cost more (linear with area)
Longer leases cost more (linear with years)
Location premiums are roughly additive

<###>MLP on Non-Linear Synthetic Data

R² (Test): 0.97
Relationship: y = x² + sin(x) + noise
Finding: MLP captures non-linear patterns linear models miss

<###>Comparison on Real HDB Data | Model | R² | Strengths | Weaknesses | |-------|-----|-----------|------------| | Linear Regression | 0.52 | Interpretable, fast | Can't learn curves | | MLP | ~0.50* | Flexible, powerful | More complex, needs tuning |

*MLP performance depends heavily on hyperparameters. With proper tuning, it can match or exceed linear regression on this dataset.

<##>Project Structure


~/MIT_Project/
├── data/
│   ├── Resale flat prices based on registration date from Jan-2017 onwards.csv (229K rows)
│   └── [4 other CSV files]
├── models/
│   ├── linear_regression.py          # Linear regression from scratch
│   ├── mlp.py                         # MLP from scratch
│   ├── data_loader.py                 # HDB data preprocessing
│   ├── train_hdb_linear.py          # Train LR on HDB data
│   ├── train_hdb_comparison.py     # Compare LR vs MLP
│   ├── linear_regression_params.json # Saved model weights
│   └── hdb_comparison_results.json   # Comparison results
├── manim/
│   ├── mit_animations.py            # Manim animation scripts
│   └── create_animations.py         # Matplotlib fallback
├── notebooks/
│   └── [For Jupyter exploration]
└── README.md                       # This file

<##>How to Run

<###>1. Linear Regression Demo


cd ~/MIT_Project/models
python3 linear_regression.py

<###>2. Train on Real HDB Data


python3 train_hdb_linear.py

<###>3. Compare LR vs MLP


python3 train_hdb_comparison.py

<###>4. MLP on Synthetic Data


python3 mlp.py

<##>Animation Storyboard

<###>Scene 1: Title "From Lines to Networks: How Machines Learn to Predict HDB Resale Prices"

<###>Scene 2: Data Visualization Scatter plot of HDB transactions (floor area vs. price)

<###>Scene 3: Linear Regression Model

Equation: ŷ = w₁x₁ + w₂x₂ + ... + b
Loss function: MSE = (1/n) Σ(y - ŷ)²

<###>Scene 4: Gradient Descent Animation showing weight updates converging to minimum

<###>Scene 5: Linear Regression Result Best-fit line on HDB data with R² = 0.52

<###>Scene 6: The Non-Linear Problem Linear model failing on curved data

<###>Scene 7: MLP Architecture Network diagram: Input → Hidden → Output

<###>Scene 8: Forward Pass Data flows through network: z⁻¹⁽ = W⁻¹⁽x + b⁻¹⁽, a⁻¹⁽ = ReLU(z⁻¹⁽)

<###>Scene 9: Backpropagation Chain rule visualization: ∂L/∂w = ∂L/∂ŷ · ∂ŷ/∂z · ∂z/∂w

<###>Scene 10: Comparison Side-by-side: Linear Regression (R²=0.52) vs MLP (R²=0.97 on non-linear)

<###>Scene 11: Conclusion Mathematics powers machine learning: statistics, calculus, linear algebra

<##>Key Takeaways

Linear Regression is powerful for linear relationships and highly interpretable
MLP can learn non-linear patterns but requires more tuning
Feature Engineering matters: creating meaningful inputs from raw data
Gradient Descent is the engine that makes learning possible
Real-world data validates theoretical understanding

<##>AI Use Declaration

AI Tool Used: ChatGPT (Claude Code / Hermes Agent)

Purpose:

Code structure and debugging assistance
Animation script templates
Mathematical formula verification

Original Contribution:

All mathematical derivations verified by hand
Model architectures designed based on coursework
Data interpretation and analysis
Final presentation and explanation

Note: All core algorithms (linear regression, MLP, backpropagation) implemented from scratch without using machine learning libraries (scikit-learn, PyTorch, TensorFlow).

<##>References

HDB Resale Price Data: https://data.gov.sg
SDG 11: Sustainable Cities and Communities (UN)
H2 Mathematics Syllabus (2027): Statistics, Calculus, Linear Algebra

Submitted: Term 2 Week 10, 2026 Group Members: [Your names here] Class: 26S6B